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ABSTRACT 

Research shows that comment spamming (comments which 
are unsolicited, unrelated, abusive, hateful, commercial ad- 
vertisements etc) in online discussion forums has become a 
common phenomenon in Web 2.0 applications and there is a 
strong need to counter or combat comment spamming. We 
present a method to automatically detect comment spammer 
in YouTube (largest and a popular video sharing website) 
forums. The proposed technique is based on mining com- 
ment activity log of a user and extracting patterns (such 
as time interval between subsequent comments, presence of 
exactly same comment across multiple unrelated videos) in- 
dicating spam behavior. We perform empirical analysis on 
data crawled from YouTube and demonstrate that the pro- 
posed method is effective for the task of comment spammer 
detection. 

Categories and Subject Descriptors 

H. 3.3 [Information Search and Retrieval]: [Information 
filtering] 

General Terms 

Experimentation, Measurement 

Keywords 

Spam detection, comment spam identification, YouTube, us- 
age data analysis, pattern recognition, user behavioral anal- 
ysis, online discussion forums 

I. RESEARCH MOTIVATION AND AIM 

Spam in domains such as emails, web-pages, blogs, so- 
cial networking websites, online discussion forums, wikis and 
video sharing websites is prevalent and naturally has sev- 
eral negative impacts such as undesirable consumption of 
computing resources, lowering the reputation or value of the 
targeted legitimate web application, impacting search engine 



rankings, overwhelming moderators and administrators, and 
obstructs and misleads genuine usage of legitimate users and 
community 17 8 3]. Previous studies show that comment 
spam in online discussion forums (the focus of this paper) 
is prevalent and techniques to counter such type of spam 
have attracted several researchers' attention 3 4 5 10 11 . 
Several content-based methods have been proposed to auto- 
matically identify spam comments. Content-based methods 
analyze the text of the post or message (such as checking 
the presence of pre-defined terms or links) in a forum and 
infer the likelihood of a message being spam or legitimate. 
While content-based methods have shown encouraging re- 
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Figure 1: High-Level system architecture and pro- 
cessing pipeline for identifying comment spammers 
in YouTube. ATDC: Average Time Difference be- 
tween Comments, PCHF: Percentage of comments 
having hasSpamHint flag, CRAV: Comment re- 
peatability across videos, CRR: Comment repetition 
and redundancy. 

suits, they are not perfect and there is a strong need to aug- 
ment or complement the capabilities of existing anti-spam 
content-based methods to counter the spam problem. Based 
on our analysis of related body of work and literature, the re- 
search area of analyzing the commenting behavior or activity 
of a user to identify spammers (a user classification task and 
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Figure 2: Plot of users in the evaluation dataset across 
two dimensions: spam percentage and number of com- 
ments. 
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Figure 3: Plot of users in the evaluation dataset across 
two dimensions: spam percentage and comment repe- 
tition and redundancy (CRR). 



not a post or message classification task) is an area which is 
relatively unexplored in contrast to content-based methods. 
We hypothesize and believe that examining the commenting 
activity {usage analysis and characterization) of a user can 
play a role in identifying spammers. The broad research ob- 
jective of the work presented in this paper is to investigate 
the application of usage-based features derived from a user's 
comment activity (by analyzing a log of recent comments 
with associated metadata) to identify comment spammers. 
The specific research aim of the work presented in this paper 
is to investigate techniques for mining usage-based discrimi- 
natory patterns and markers to identify comment spammers 
in YouTube forums (a very popular and largest video sharing 
website on Internet). 

2. RESEARCH CONTRIBUTIONS 

Heymann et al. present a survey of approaches for fighting 
spam on social websites [9]. Hayati presents an evaluation 
and analysis of Web 2.0 anti-spam methods [7|. Benevenuto 
et al. provide a general overview of pollution in video shar- 
ing systems (evidence of pollution, types of pollution, affect 
on the system and control strategies) such as YouTube [I]. 
Yo-Sub Han et al. present an algorithm to evaluate the rep- 
utation of a user in YouTube by mining the user's social ac- 
tivity and interactions (such as subscriptions and uploaded 
contents) [B|. Benevenuto et al. introduce a technique to 
detect video spammers in YouTube [2]. The similarity be- 
tween the study by Yo-Sub Han et al. and Benevenuto et al. 
(closely related work) and this paper is that the aim of the 
work is to perform a user classification (particularly auto- 
matic user reputation determination, video spammer identi- 
fication, and comment spammer detection) task in YouTube. 
The main difference is that in this paper we explore certain 
commenting activity and attributes (novel in context to cur- 
rent solutions) of a user to detect the likelihood of the user 
as forum spammer. In context to closely related work, this 
paper makes the following novel and unique contributions. 
This paper presents the first study (on YouTube) of mining 
the recent activity log of a user to extract usage-based fea- 
tures (particularly prevalence of high comment repeatability, 
presence of exactly same comment across videos, presence 
of ultra low time difference between comments, presence of 
a large number of spam tags by the community or modera- 



tors) to identify spammers. This paper presents an empirical 
study on dataset crawled from YouTube and demonstrates 
that the proposed usage-based and behavioral features can 
be used or exploited as markers for the task of automatically 
detecting comment spammers in YouTube forums. The pa- 
per offers fresh perspective and insights on the characteris- 
tics and properties of comment spammers on YouTube. 

3. SOLUTION APPROACH 

Figure Q] presents a high level solution framework and key 
components of the proposed systems. YouTube discussion 
forums (threaded discussions in response to an uploaded 
video) have a feature in which comments are marked as 
hasSpamHint. We carefully observed (based on manual and 
visual inspection) several forums of several videos across var- 
ious categories and notice many comments correctly tagged 
as hasSpamHint (wherein the comments are still visible). 
However, we also notice many (significant percentage) spam 
comments which are not tagged as hasSpamHint (perhaps 
due to practical infeasibility of manually analyzing very large 
volumes of comments by administrators). Furthermore, the 
tagging of hasSpamHint is performed at the comment-level 
and not at the user-level. 

The proposed approach consists of first retrieving com- 
ments marked with hasSpamHint for a given video. We 
then extract userids behind the spam comments. YouTube 
API^3 provide functions to retrieve the recent commenting 
activity (a log of comments and the associated metadata) of 
a given user. As shown in Figure [T] we extract several com- 
ment attributes from the discussion-forum usage-log: text of 
the comment, timestamp, VideoID of video commented-on 
and the value of the binary variable hasSpamHint. The next 
step consists of computing the values of variables indicating 
the spam intention of user (as spammer). We define four in- 
dicators and describe our intuition (and design justification) 
behind the proposed indicators. The value of the following 
four indicators (heuristics) is then used to score a give user 
as comment spammer. 

3.1 ATDC 

Average Time Difference between Comments (ATDC): We 
1 http://code. google.com/apis/youtube/overview. html 



extract all the recent comments (the number of comments 
that can retrieved is limited by YouTube API) by a user 
and compute the time differences between all the comments 
(comparing each comment with every other comment in the 
log). We compute the average time difference and record 
the value. We hypothesize that a low value of ATDC signals 
spam. Our conjuncture is based on the observation that 
spammers often employ automated scripts or spam robots 
for posting comment as a result of which the time difference 
between subsequent comments is so low that it is not manu- 
ally feasible. We confirm the presence of the phenomenon (in 
the evaluation dataset and also based on our manual inspec- 
tion of several YouTube forums) wherein the time interval 
between sequences of comment is less than few seconds. 

3.2 PCHF 

Percentage of Comments with hasSpamHint Flag (PCHF): 
We compute the percentage of comments marked as hasS- 
pamHint. We hypothesize that a significant percentage of 
comments by a user marked as hasSpamHint can be used as 
a signal for classifying the user as spammer. We confirm the 
prevalence of this phenomenon in YouTube user comment 
logs. We notice several users who are spammers (validated 
based on manual inspection) exhibit a high PCHF value. 

3.3 CRAV 

Comment Repeatability Across Videos (CRAV): We ob- 
serve a phenomenon (which is exploited as an attribute and 
heuristic for spammer categorization task) wherein a user 
posts exactly same comment across discussion forum accom- 
panying several different videos. A visual inspection of this 
phenomenon clearly shows that users posting same message 
across several videos is a case of content promotion and is a 
reliable comment activity marker for identifying spammers. 
A high variability in terms of videoids and a high similarity 
of comments posted by a user is employed as an indicator 
in the proposed solution. 

3.4 CRR 

Comment Repetition and Redundancy (CRR): We ob- 
serve the presence of a pattern wherein a user simply re- 
peats and posts the same message on the same video (some- 
times within a small time interval and sometimes reasonable 
spread across the time dimension but still the same mes- 
sage). We hypothesize that a high value of CRR signals 
spam. We compare the text of every comment with the text 
of all other comments in the log of recent comments posted 
by a user (1 for an exact match and for a non-match) and 
compute the average CRR value. 

4. EMPIRICAL ANALYSIS 

We extract comment activity log of 240 unique users con- 
sisting of 13000 comments from some of the top rated and 
most viewed videos on YouTube. Figures [2I3I4I5I and l6l plot 
119 users (for which the number of comments was greater 
than 5) across multiple dimensions and attributes. Figure [2] 
reveals that there are several users with more than 20 com- 
ments having more than 50% of the comments tagged with 
hasSpamHint flag. 

We observe several users with more than 60 comments and 
more than 70% of them were marked as hasSpamHint. Fig- 
ure |3] provides a different perspective that plots each user on 
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Table 1: An illustrative list of comments of some of 
the users identified as spammers in the experimental 
dataset 

the attribute of percentage spam and CRR (comment rep- 
etition and redundancy). Figure [3] clearly shows that users 
A, B, C and D have posted more than 30 comments (A 35, 
B 118, C 36 and D 30), have a CRR value of more than 0.7 
(which means posting same comment multiple times) and 
have 80% of the comments marked as spam by the modera- 
tor. Users on top right corner of Figures[2]and[3]are potential 
spammers. We perform a manual inspection of such users 
and confirm the hypothesis to be true. Table [T] shows an 
illustrative list of comments of some the users identified as 
spammers according to the proposed approach. 
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Figure 6: Plot of users in the evaluation dataset 
across three dimensions: log of average time differ- 
ence between comments (ATDC), number of com- 
ments and spam percentage. 

Figure fj] reveals users having high comment overlap and 
low video overlap (means several similar comment posting 
but in a single or small set of videos) as well as users hav- 
ing high comment overlap and high video overlap (a phe- 
nomenon wherein a user posts exactly same comments across 
multiple videos). Figure [5] reveals users posting a large num- 
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Figure 4: Plot of users in the evaluation dataset across 
two dimensions: video overlap and comment repetition 
and redundancy (CRR). 
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Figure 5: Plot of users in the evaluation dataset across 
two dimensions: log of average time difference between 
comments (ATDC) and number of comments. 



ber of comments in a small time interval (y-axis is log of 
the metric average time difference between comments in sec- 
onds). Users in the bottom right corner (below 3 in y value 
and above 20 in x value) of Figure [5] are potential spam- 
mers. Figure[6]is a plot of several users across 3 dimensions. 
Based on our manual inspection of the data, we derive the 
following rule to automatically classify a vector representing 
a YouTube user behavior across four dimensions (PCHF, 
ATDC, COMOVP and VIDOVP). All the users (number of 
comments > 5 as minimum threshold) satisfying the follow- 
ing rules were manually annotated as spammers. 

SPAMMER = (PCHF > 70) OR (ATDC < 150) 
OR (COMOVP > 0.60) OR (VIDOVP > 0.60) 

A manual inspection of user profiles and comments demon- 
strate that comment spammers are prevalent in YouTube 
forums and the proposed heuristics (based on testing the 
presence of pre-defined spam indicators or markers in a users 
comment activity log) is reliable in spammer detection (refer 
to Table [1] a manual inspection of the comments posted by 
identified users clearly indicates spam). 

5. CONCLUSIONS 

We describe a method (rule-based system) to automat- 
ically identify comment spammers in YouTube forums by 
mining comment activity log of users. Applying the pro- 
posed method on a sample dataset reveals that the technique 
is effective in identifying spammers. We hypothesize certain 
characteristics of comment spammers and perform an em- 
pirical study to test the proposed hypothesis. Our findings 
indicate that attributes such as presence of large number of 
exactly same comment in a single or across multiple videos, 
very small time intervals between subsequent comments and 
a large percentage of comments having spam hint flag are re- 
liable indicators for categorizing YouTube forum spammers. 
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